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1. INTRODUCTION 

Nowadays electronic health information and electronic health applications are available in large 
quantity [1]. The users of health information like health care providers, researchers, analysts use this data for 
making inferences [2]. Since health records contain the private data of patient, the access is restricted. To 
make this access easy and possible, privacy preservation techniques are useful. Electronic health records are 
useful for the communication and keeping the information of patient intact. The demand of such big amount 
of electronic health data has increased concern of privacy for the patients [3]. For providing privacy to 
electronic health data de-identification techniques are used. These techniques provide privacy by removing 
direct identifiers which can expose identity of individual or disclose sensitive information of individual. It 
provides privacy by suppression, generalization or replacement of the identifiers [4-5]. 

Various laws in different countries are available for providing privacy to electronic health data [2]. 
In USA Health Insurance Portability and Accountability Act (HIPAA), Patient Safety and Quality 
Improvement Act (PSQIA), HITECH Act protects privacy of electronic health data. Data Protection Act 
(DPA) in UK provides options to individuals for protecting information. Russian Federal Law on Personal 
Data in Russia makes it necessary to take all permissions for organizations before handing over the health 
data to other. Personal Information Protection and Electronic Documents Act (PIPEDA) in Canada give 
citizens right to know the reasons behind the collection of private data [6]. IT Act and IT (Amendment) Act 
in India suggests strict actions like imprisonment or fine for misusing personal information. Data Protection 
Directive in European Union helps to keep fundamental rights of people with respect to accessing of 
personal data. 

In the anonymization of electronic health data de-identification methods are used. These methods 
are further divided into K-anonymity, L-diversity, T- closeness [7-9]. In the L-diversity, there is possibility of 
similarity attack. In Figure 1 architecture of delay free anonymization for privacy preservation is shown [10]. 
Input data is coming from source in terms of tuples. This tuple is divided in two parts. 
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Figure 1. Delay free anonymization: Architectural diagram 


First part contains quasi identifier and group number. Second part contains sensitive tuple along 
with its L-1 counterfeit values, count of each sensitive value and group number. Adding L-1 counterfeit 
values with real value will make difficult to disclose the sensitive value. These counterfeit values will be 
validated with the upcoming input tuples. Groupwise count of released tuples will be maintained. The 
similarity of counterfeit values will affect privacy. It can be avoided by replacing similar values in the group. 


2. RELATED WORK 

There is need of establishing guidelines for privacy against invasive marketing and inadvertent 
privacy disclosure [11]. Privacy requirements in data sharing for big data operators need scalable privacy 
preserving algorithms to provide privacy to the datasets. Health information providers can benefit from cost- 
profit model to take decision about sharing the health related data to other parties [12]. Privacy requirements 
are important in big data collection, storage and intra and inter-organization processing. To make the 
computing of big data in privacy preserved way Privacy preserving aggregation, encrypted data operations 
and de-identification techniques are suggested [13]. In data privacy, it is required to understand privacy 
requirements in data provider, data collector, data miner, decision maker stages [14]. Need of keeping source 
or origin of data is important to identify privacy attack. In [10] delay free anonymization technique is used 
for to reduce delay and increase data utility by late validation. 

Distributed stream processing is done with extending storm capabilities for task management, 
scheduling, and executing in distributed manner [15]. DART system propose framework for different devices 
present on remote sites in distributed environment. This framework provides facility of registration and 
authorization of devices on remote site, task allocation and management of user application. In the system 
computation load is reduced by utilizing idle resources [16]. The distributed stream processing systems 
possess different availability requirement for different applications. When one of the nodes in distributed 
environment gets failed, the backup or secondary server resumes the execution. While doing this, the state 
should be maintained. The type of recovery technique and performance is based on stream processing 
application [17]. The new stream processing systems exploit the tasks instead of nodes for fault 
tolerance [18]. 


3. RESEARCH METHOD 
3.1. The Need and Importance of the Problem 

Electronic health data is produced in large quantity. In anonymization of this data minimum 
execution time and less information loss is important. Anonymization delay is minimized using delay free 
framework. To avoid similarity attack on |-diverse counterfeit group, replacement of similar value is 
required. Due to large amount of tuples of electronic health data, there is possibility of formation of similar 
groups and it can disclose the sensitive value. Repetition of such values in group is avoided using the 
synthetic value formation. The complexity of big electronic health data creates challenge for existing privacy 
preserving algorithm which cannot work on large datasets. 

In Figure 2 to avoid similarity attack, similarity index of each group is calculated [19]. If some 
values are similar then such values will be replaced with other values. For this replacement help of past data 
is taken. With the policy of past reflect future, for early late validation of counterfeit values in the group the 
values from the past data are selected. Information loss and utility of the replaced data is calculated. It will be 
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note down in statistic data to see if that replaced value in counterfeit group caused more or less 
information loss. 
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Figure 2. Work flow model to avoid similarity attack 


3.2. Algorithm 

In the Figure 3 algorithm for the proposed method using big data as input is given. For each tuple set 
of streaming big data input [20], the source is maintained. It is useful to find source of data in case of 
adversary attack. 


ILT Information loss threshold 
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Calculate Counterfeit and create groups 
If counterfeit values repeated more among groups 
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End IF 
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Figure 3. Algorithm for proposed method 


The incoming stream data may not be in suitable format. Preprocessing is used to convert the 
incoming data in suitable format. The steps used for the preprocessing are as follows. 
a. Read the url or address of streaming data source. 
b. Load the raw data in dataset file. 
c. Read the first line of attributes in the file and split it as per the delimiter. 
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Convert the split data of first line into columns. 

Read the files data in buffer line by line up to end of file convert it into tuple. 
Split the data stream using delimiter and insert in the columns. 

Identify the quasi and sensitive identifiers in data table. 

After preprocessing the data is available in proper format. The Anatomy [21] technique divides 
input tuples into two parts. The counterfeit values will be added to form the groups. If the counterfeit values 
are repeated more no of times in the groups, synthetic values can be used to replace these repeated values 
otherwise past data is sufficient to form group of counterfeit values. For each individual group of counterfeit 
values similarity index of group will be calculated. If the values are similar then these values will be replaced 
with other values from the past data. Late validation is done by maintaining the group count and the released 
tuples in the group. Statistic data of information loss and utility measures is maintained. If information loss 
ratio is more than threshold value then the process is repeated by changing the values in the group. 


garmon 


4. DISTRIBUTED EXECUTION FLOW 

In case of analysis if the organization does not have enough processing capability and infrastructure 
to process large amount of data, such stream data will be given to third party. In such situation existing 
methods are inappropriate to provide enough privacy. Figure 4 show the distributed execution flow of big 
streaming data. In delay free anonymization method L-diverse counterfeit values will be generated when new 
tuple arrives. It generates these values from past data (domain of sensitive values). For big data, millions of 
tuples are arriving in one session and randomly counterfeit values get generated [20]. There is probability of 
similar values getting generated in a group. This may cause similarity attack on the patient data. 
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Figure 4. Distributed Execution flow for big streaming data to avoid similarity attack 


To avoid this situation when the similar values get generated in the group, we can replace these 
similar values with other sensitive values so that similarity attack can be avoided. At the same time repeated 
values among the groups are found and such values are replaced with synthetic values. Vertical dotted lines 
in Figure 4 show the execution on different nodes in distributed fashion. While tuples are anonymized and 
published on first node, second node will be used for the group data formation and replacement of similar 
values. Third node will keep statistic data based information loss due to replaced or synthetic values in group. 

Domain of sensitive values contains limited values and these values are getting repeated. For 
example for N records N/L groups of counterfeit values will be generated. For 500 records 50 groups with 
L=10 will be generated. But as the big data is the input for example there are 500000 records and L=10. It 
will generate 50000 groups. In each group the counterfeit sensitive values will get repeated. In sensitive 
domain if we have 50 unique values. For 50000 groups, repetition of 50 values will be 1000 times in different 
groups. To avoid this repetition of sensitive domain values in the groups, few values can be replaced with 
synthetic values. The probability of disclosure of real sensitive value is increased if repetition of sensitive 
values in groups takes place. Creating groups of counterfeit values for millions of records in very short time 
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and finding repeated or similar values in groups in very short time will require executing this work in 
distributed or parallel fashion. 


5. RESULTS AND ANALYSIS 

For processing the big streaming data, we have used task level parallelism and data level 
parallelism. For the tasks like reading streamed data from source, preprocessing of streamed data and 
counterfeit and loss management parallelism is applied. To achieve the result stream data is processed on 
flink data processing engine [22]. It supports for processing of big datastreaming as well as batch data 
processing. Flink data engine also support for complex event processing, machine learning and graph 
analysis. Table 1 shows the similarity values of sensitive value of tuple of different groups obtained by 
executing this data parallelly using different measures. 

In Table 1 similarity between different group values calculated. When the tuple appear, it is released 
using the counterfeit value addition in the group. Table 1 shows similarity values for dengue, leprosy, malaria 
and diphtheria sensitive value with other counterfeit values. Similarity results are obtained using different 
measures [23]. 


Table 1. Similarity Values for Different Groups Using Different Measures 


Pathleneth 0.1429 02 01 Ol 0 ON Oll 0125 01429 01429 0l OLN 01 QUILL olll 0.125 
Kang & 01618 03012 01275 0.1275 01672 0.2172 0205 02284 0181 01611 01392 0.1496 01632 02105 019 021 
Conrath 
Conceptual 0.1429 02 01 Ol Ol OL Oll 0125 01429 01429 0l ou 0.1 oulo olo 0.125 
dista 


Lin 02361 0621 0195 0206 02422 04629 04629 05164 0257 02354 02101 01223 02378 04552 0482 0500 


Wu & Palmer, Path length, Jiang & Cornath, Conceptual distance and Lin measures are used to find 
the similarity of values in the group [23]. Based on those measures, to avoid similarity attack similar value 
can be replaced with other value in that group. Measures for four groups with real sensitive values dengue, 
leprosy, malaria and diphtheria are shown in Table 1. Figure 5 shows graph comparison for similarity of the 
real sensitive value with group value. 
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Figure 5. Graph based on similarity in different group using different measures 
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6. CONCLUSION 

Privacy preservation framework to avoid similarity attack in electronic health streams is proposed. 
To find different similarity of sensitive value with counterfeit value, similarity measures are used. 
Replacement of similar counterfeit values is done by past data of tuples to increase data utility. For big 
streaming data synthetic values are used for replacement of counterfeit values among groups. Anonymization 
delay of framework is reduced using distributed execution. 
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